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System and method for detecting a huma 


n face 



(57) The present invention relates to a system for 
the processing of video images which include human 
faces. The invention is applicable to a system in which 
the images are generated by a video camera and stored 
in a storage means ready to be processed. 

The system for processing the images include com- 
ponent analysis means (212,213) to analyse the pixels 
of the image to identify a region of connected compo- 
nents in the foreground of the image. An ellipse fitting 
means (503,504,505,506,507) performs an iterative el- 
lipse fitting algorithm to fit one or more vertical ellipses 



to the connected components in the identified region, 
each ellipse representing a possible human face. In or- 
der to distinguish between occluded human figures, a 
plurality of possible models of borders are presented to 
separate individual faces in the identified region. Prob- 
ability computirg means (403) perform a computation 
of the probability of each model based on the ellipse or 
ellipses fitted in the identified region. The parameters of 
each model are iteratively adjusted to maximise the 
probability computation for that model and a selection 
is made of the model having the highest probability. 
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Description 

The present invention generally relates to real-time video image analysis, and more specifically to the detection 
of human faces and eyes within real-time video images. 

5 In recent years, the detection of human faces from video data has become a popular research topic. There are 

numerous commercial applications of face detection, such as in face recognition, verification, classification, identifica- 
tion as well as security access and multimedia. To extract the human face in an uncontrolled environment, most prior 
arl techniques attempt to overcome the difficulty of dealing with issues such as variations in lighting, variations in pose, 
occlusion of people by other people, and cluttered or non-uniform backgrounds. 

10 In one prior art face detection technique, an example-based learning approach for locating unoccluded human 

frontal faces is used. The approach measures a distance between the iocai image and a few view-based "lace" and 
"non face" pattern prototypes at each image location to locate the face. In another technique, the distance to a "face 
space", defined by "eigenfaces", is used to locate and track frontal human faces. In yet another prior art technique, 
human faces are detected by searching for significant facial features at each location in the image. Finally, in other 

is techniques, a deformable template based approach is used to detect faces and to extract facial features. 

In addition to the detection of faces within video image sequences, prior art systems have attempted to detect 
eyes on human heads. For example, Challepa et a!., "Human and Machine Recognition of Faces: A Survey"", Pro- 
ceedings of the IEEE , vol. 83, no. 5, pp. 705-740, May 1 995, described a process for detecting eyes on a human head, 
where the video image includes a front view of the head. Forf rontal views, eye detection that is based on geometrical 

20 measures has been extensively studied, by, for example, Stringa, "Eyes Detection for Face Recognition", Applied 
Artificial Intelligence , vol.. 7, no. 4, pp. 365-382, Oct. -Dec. 1993 and Brunelli eta!., "Face Recognition: Features versus 
Templates", IEEE Transaction on Pattern Analysis and Machine Intelligence , October 1993. Additionally, Yuilee et al., 
■Feature Extraction from Faces Using Deformable Templates", International Journal of Computer Vision , vol. 8, pp. 
299-31 1 , 1 9S2, describe a deformable template-based approach to facial feature detection. However, these methods 

25 may lead to significant problems in the analysis of profile or back views. Moreover, the underlying assumption of dealing 
only with frontal faces is simply not valid for real-world applications. 

There is therefore a significant need in the art for a system that can quickly, reliably and flexibly detect the existence 
of a face or faces within a video image, and that can also extract various features of each face, such as eyes. 

According to the invention a system for processing a video image comprising pixels representing a foreground 

so including one or more human faces, the system comprising; 

component analysis means to process the pixels of the image to identify a region of connected components in the 
foreground of the image, 

ellipse fitting means to perform an iterative ellipse fitting algorithm to fit one or more ellipses to the connected 
35 components in the identified region, 

means to-provide a model of borders for one or more separate individual faces in the identified region, ■ • 
probability computing means to perform a computation of the probability of the model based on the ellipse or 
ellipses fitted in the identified region, and means to iteratively adjust the model to maximise the probability com- 
putation. 

40 

The invention will now be described by way of example only with reference to the accompanying drawings: - 
FIG. 1 is a block diagram of the present invention. 

FIG. 2 is a flow diagram, depicting the overall operation of the. present invention. 

FIG. 3 is a flow diagram. depicting a process for choosing.the most likely model ofpeople .within the video image. 
45 FIG. 4 is a flow diagram further depicting the modeling process' of F!G. 3. 

FIG, 5 is a flow diagram depicting a process for fitting an ellipse around the head of a person detected within a 
video image. 

FIGS. 6A-6D, 7A-7C, 8A-8D and 9A-9D depict examples of video images that may be processed by the present 
invention. 

so FIG. 10 depicts criteria that . may be used to model a face withina video-image. 

FIGS. 11-12 are flow diagrams Depicting processes that are- performed by the present invention. 

1 . The Video System 

ss FIG. 1 depicts the overall structure of the present invention in one embodiment. The hardware components of the 

present invention may consist of standard off-the-shelf components. The primary components in the system are one 
or more video cameras 110, one or more frame grabbers 120, and a processing system 130, such as a persona! 
computer (PC). The combination of the PC 130 and frame grabber 120 may collectively be referred to as a "video 
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processor" 140. The video processor 140 receives a standard video signal format 115, such as RS-170, NTSC, CCIR, 
PAL, from one or more of the cameras 110, which can be monochrome or color, in a preferred embodiment, the camera 
(s) 110 may be mounted or positioned to view a selected area of interest, such as within a retail establishment or other 
. suitable location. 

£ The video signal 115 is input to the frame grabber 120. in one embodiment, the frame grabber 120 may comprise 

a Meteor Color Frame Grabber, available from Matrox. The frame grabber 120 operates to convert the analog video 
signal 115 into a digital image-stored within the memory 135, which can be processed by the video processor 140. For 
example, in one implementation, the frame grabber 120 may convert the video signal 115 into a 640 x 480 (NTSC) or 
768 x 576 (PAL) color image. The color image may consist of three color planes, commonly referred to as YUVor YIQ. 

io Each pixel in a color plane may have 8 bits of resolution, which is sufficient for most purposes. Of course, a variety of 
other digital image formats and resolutions may be used as well, as will be recognized by one of ordinary skill. 

As representations of the stream of digital images from the camera(s) 1 1 0 are sequentially stored in memory 1 35, 
analysis of the video image may begin. All. analysis according to the teachings of the present invention may be per- 
formed by the processing system 130, but may also be performed by any other suitable means. Such analysis is 

-is described in further detail below. 

2. Overall Process Performed by the Invention 

An overall flow diagram depicting the process performed by the processing system 1 30 of the present invention 

so is shown in FIG. 2. The first overall stage 201 performed by the processing system 130 is the detection of one or more 
human heads (or equivalent) within the video image from camera 1 1 0, which is stored in memory 1 35, and the second 
overall stage 202 is the detection of any eyes associated with the detected human head(s). The output 230 of stages 
201-202 may be passed to recognition and classification systems (or the like) for further processing. 
The steps performed in stage 201 are described below. 

25 The first steps 212-213 and 21 6 (of the head detection stage 201 ) is the segmentation of people in the foreground 

regions of the sequence of video images stored in memory 135 over time, which is represented in FIG, 2 as video 
sequence 211 , Such segmentation is accomplished by background modeling (step 216), background subtraction and 
thresholding (step 212) and connected component analysis (step 213). Assuming the original image 600 of FIG. 6A 
(which may be stored in memory 1 35, etc.), as shown in FIG. 6B, the result of steps 21 2 and 21 3 is a set of connected . 

so regions (blobs) (e.g., blobs 601) which have large deviations from the background image. The connected components 
601 are then filtered also in step 21 3 to remove insignificant blobs due to shadow, noise and lighting variations, resulting 
in, for example, the blobs 602 in FIG. 6C. 

To detect the head of people whose bodies are occluded, a model-based approach is used (steps 214-215, 217). 
In this approach, different foreground models (step 217) may be used for the case where there is one person in a 

35 foreground region and the case where there are two people in a foreground region. The output of step 214 are the 
probabilities of the input given each of the foreground region models. Step 215 selects the model that best describes 
the foreground region by selecting the maximum probability computed in step 21 4. An example output of step 21 5 is 
shown in Figure 6D, wherein the ellipse 603 is generated. 

The functionality performed by system 130 of steps 214-215 and 217 is illustrated in FIGS. 9A-9D. Each of FIGS. 

40 9A-9D represent a video image that may be created by frame grabber 120 and stored in memory 135 (FIG. 1). FIG. 
9A depicts an example foreground region representing one person 901. The one person model (x1 , x2) matches the 
input data. FIG. 9B depicts the same foreground region modeled as two persons (x1 , x2, x3). In this case two dashed 
ellipses 911, 912 are fitted but they do not represent the correct location of the head 913. The probability of the fore- 
ground region is computed for each model as is described later and the system automatically selects the model for 

45 one person to best describe the foreground region in this case. 

FIGS. 9C and 9D depict an example foreground region with two people 902, 903 with occluded bodies. In this 
case, the system 130 of the present invention selects the two people mode! (x1, x2, x3) to best represent the data. 
When a single person model is used to describe the foreground region, the large dashed ellipse 921 is fitted which 
does not correspond to any of the people's 902, 903 heads. The system does not select the single person model 

so because the probability of one person model for the given input data is lower than the probability of the two person 
model given the input data. 

The next overall stage 202 in the present invention is the detection of eyes from varying poses and the extraction 
of those faces that correspond to frontal views, in prior art articles, such as those described by Turk et a!., "Face 
Recognition Using Eigenfaces", Proceedings on International Conference on Pattern Recognition , 1991 and Brunelii 
55 et a]., "Face Recognition: Features versus Temp'ates", IEEE Transactions on Pattern Analysis and Machine intelli- 
gence , vol. 15, no. 1 0, October 1 993, techniques have been proposed whereby eyes are detected from frontal views. 
However, the assumption of frontal view faces is not valid for real world applications. 

In the present invention, In steps 221-222 the most significant face features are detected by analyzing the con- 
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nected regions of large deviations from facial statistics. Region size and anthropological measure-based filtering detect 
the eyes and the frontal faces. Eye detection based upon anthropological measures for frontal views has been studied 
in the prior art (see, e.g., Brunelli et al., cited previously). However, such methods can run into problems in the analysis 
of profile or back views of faces. In step 223, filtering based on detected region size is able to remove big connected 
components corresponding to hair as well as small regions generated by noise or shadow effects. In step 224, the 
remaining components are filtered considering the anthropological features of human eyes for frontal views, and again 
the output 230 may be passed to another system forturther processing. The eye detection stage 202 of the present 
invention is described in further detail below. 

3. Segmentation of Foreground Regions 

To extract moving objects within the video image stored in memory 135, the background may be modeled as a 
texture with the intensity of each point modeled by a Gaussian distribution with mean u. and variance a, A/ b (u,a) (step 
216). The pixels in the image are classified as foreground if p(0(x,y)\N b {\i,G)) < Tand as background if p{0{x,y)\N b (\i, 
<j))> T. The observation 0{x,y) represents the intensity of the pixels at location (x,y), and Tis a constant (step 21 2). 

The connectivity analysis (step 213) of the "foreground" pixels generates connected sets of pixels, i.e. sets of 
pixels that are adjacent or touching. Each of the above sets of pixels describe a foreground region. Small foreground 
regions are assumed to be due to shadow, camera noise and lighting variations and are removed. 

4. The Foreground Region Modeling System 

The foreground regions are analyzed in further detail in steps 214-215 and 217 to detect the head. It is known that 
if there is only one head in the image, then it may be detected by finding the upper region in each set of connected 
foreground regions. However, this technique fails when people in an image are occluded by other people. In this case, 
a foreground region may correspond to two or more people, and finding the regions corresponding to heads requires 
a more complicated approach. In the case of partial people occlusion, in which bodies are occluded by other bodies, 
but heads are not occluded, special processing must be performed. 

To determine the head positions in this case, the number of people in each foreground region must be determined. 
As shown in FIG. 3, in order to determine the number of people within the video image, N separate models A., (301 ), 
(where /may equal 1 to N) may be built, each model X.,-301 corresponding to /people in a set of connected foreground 
region. Based on the assumption that faces are vertical and are not occluded, the model parameters for model fyare 
(Xo.Xt ,...*,■) where Ms the number of people and x k (where k= 1 to /) specifies the horizontal coordinates of the vertical 
boundaries that separate the / head region in model \ ; . The approach used to determine the number of people in each 
foreground region is to select in step 215 the model X.,-301 for which the maximum likelihood is achieved: 



= argniaxy J ((;(.r,;-)K) 



where the observationsO(x,y)ars the pixel intensities at coordinates (x,y) in the foreground regions and P{0(x,y)\X) is 
the likelihood functions for the i th model 301 . 

The probability computation steps 302 in FIG. 3 determines the likelihood functions for each model 301. In step 

45 21 5, the observations O(x.y) in the. foreground regions are used to find for each model X ; 301 the optimal set of pa- 
rameters {x 0 ,x v . ..x,)that maximize P{0(x,y)\X,), i.e. to find the parameters (x 0 ,x n ,-X;) that "best" segment the foreground 
regions (step 215). It will be shown later that the computation of P[0(x,y)\\} for each set of model parameters 301 
requires an efficient head detection algorithm inside each rectangular window bordered by x f1 and x j . /= 1 ,..,/' . 

It is common to approximate the support of the human face by an ellipse, in addition, it has been determined that 

50 the ellipse aspect ratio of the human face is, for many situations, invariant to rotations in the image plane as well as 
rotations in depth. Based on the above, the head model 301 is parameterized by the set (x 0 ,y 0l a,b), where x 0 and y 0 
are the coordinates of the ellipse centroid and a and b are the axis of the ellipse. The set (x 0: y 0 ,a,b) is determined 
through an efficient ellipse fitting process described elsewhere with respect to FIG. 5. 

55 5. Computation of Foreground Model Likelihood Functions 

Based on the assumption that human faces are vertical and are not occluded, it is deemed appropriate to param- 
eterize models A,- 301 over the set of parameters (%x,,...Xj) which are the horizontal coordinates of the vertical borders 
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that separate individual faces in each foreground region. The set of parameters (x 0 ,x-,,...Xj) is computed iteratively to 
maximize P(0{x,y)\Xy In a Hidden Markov Model (HMM) implementation (described further in Rabiner e1 al., "A Tutorial 
on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE , February 
1 989), this corresponds to the training phase in which the model parameters are optimized to best describe the observed 
s data. 

To define the likelihood functions P(0(x,y)l>u ( ), a preliminary discussion about the head detection process algorithm 
may be helpluL In the present invention, the head is determined by fitting an ellipse around the upper portions of the 
foreground regions inside each area bounded by Xj.^,Xjj= 1 ,.../. The head detection problem is reduced to finding the 
set of parameters (x 0 ,y 0 ,a,£>) that describe an ellipse type deformable template (step 402 in FIG. 4). Parameters x 0 

10 and y 0 describe the ellipse centroid coordinates and aand bare the ellipse axis. The ellipse fitting algorithm is described 
in more detail with respect to FIG. 5. 

For each set of parameters (x a ),y 0 ,a,b) a rectangular template (W in FIG. 10) is defined by the set of parameters 
(x 0 ,y 0 ,aa,a£>), where x Q and y 0 are the coordinates of the center of the rectangle and aa.ab are the width and length 
of the rectangle, and a is some constant (see FIG. 10). In each area bounded byx^.Xy, R out j is the set of pixels outside 

is the ellipse template and inside the rectangle template and fl^j-is the set of pixels inside the ellipse template (FIG. 10). 
The regions fl^-and R out j locally classify the image in "face" and "nonface" regions. Based on the above discussion, 
the likelihoodfunction P(0(x,y)IX ; ) for the model 1,-is determined by the ratio of the number of foreground pixels classified 
as "face" and background pixels classified as "non face" in each area bounded byx^.Xy, (where j=1 to /), over the total 
number of. pixels in "face" and "non face 1 ' regions (step 403). This is described in Equation (2) below: 

20 

I( X/(-v,.y)+ Z h ^y)) 

P(0(x,yU,) = - J -' ( ^=±i 

J-- 1 l ■*•.'•)'■'{,„., !*-.>■ 

(2) 

so 



where b(.x,y) = 
(3) 



rfp{0(x,y)\N h (u,a))>T 
0, otherwise 



and f{x,y) = 
(4) 



1, ifp(0{x,y)\N ll {fj.a))<T 
0, otherwise 



The goal in steps 301-302 is not only to compute the likelihood functions P(0(x,y)!X ; ) for a set of parameters (x 0 , 
x-,,...x ; ), but also to determine the set of parameters that maximize P(0{x,y)\Xy The initial parameters (x Q ,x^,...x^ for 
modal A.,- 301 are chosen to uniformly segment the data, i.e. Xj-x jA =(x, - x 0 )//'(wbere j=1 to ;). As described in FIG. 4, 
so- the parameters (xq^,..^) are iteratively adjusted to maximize PfOfx^lA,;) (s1ep404). The iterations are terminated if 
the difference of the likelihood functions in two consecutive iterations is smaller than a threshold (step 405). 

In a two person model in one embodiment, x, is the only parameter that is iteratively adjusted for the estimation 
of the model. The computation of the likelihood function for a two person model is described in the following steps. 
The reference numerals within [brackets] correspond to [ike-numbered steps illustrated in FIG. 11: 

ss 

[1101] the initial values of (x 0 ,x 1 ,x 2 ) are determined such that x 0 , is the x coordinate of the leftmost point of the 
foreground region, x 2 is the x coordinate of the rightmost point of the foreground region and = (x 0 + x 2 )/2 . 
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[11Q2] the ellipse fitting process (step 402 in FIG. 4) is performed in each of the two vertical slots bounded by (x 0 , 
x^/and (x,,x 2 ) pairs. The ellipse fitting algorithm will be described in more detail later with respectto FIG. 5. 

[1103] for the ellipses found, the following parameters are computed (step 403 in FIG. 4): 

S = UM >- (4A) 



(4B) 



[1104] reesimate the value of x, according to the following formula: 

+M-{(S m ,,-S^ 0 )-(S mJ -S oulA )} C4C) 
where p is a constant arround 20. 

[1105] compute P[0\l 2 ) from Equation (2). If the difference between P[(Xk£ for consecutive values of parameter 
x, is smaller than a threshold, stop iterations. The parameters of the ellipses given by the ellipse fitting algorithm, 
performed in each slot bounded by-(x a> x-i) and (x^), will determine the location and size of the people heads in 
the foreground region. If the difference between P(OlX i ) for consecutive values of parameter x, is bigger than the 
same threshold, then go to step 1102. 

6. The iterative Ellipse Fitting Algorithm 

In step 402, the head within a video image is detected by iteratively fitting an ellipse around the upper portion of 
the foreground region inside the area bounded by x^Xj (where /= 1, ..,/), The objective in an ellipse fitting algorithm is 
to find the x Q ,y 0 ,a and b parameters of the ellipse such that: 

{(x-x G )iaf + {{y-y 0 )lbf = \ (5) 

A general prior art technique lor fitting the ellipse around the detected blobs in step 402 (Fig 4) is the use of the 
Hough Transform, described by Chellapa et al. in "Human and Machine Recognition of Faces: A Survey", Proceedings 
of the IEEE , vol. 83, no. 5, pp. 705-740, May 1993. However, the computational complexity of the Hough Transform 
approach, as well as the need for .a robiist edge detection algorithm, make it ineffective for real-time applications, 

A better alternative for fitting the ellipse in step 402 (FIG. 4) is an inexpensive recursive technique that reduces 
the search for the ellipse parameters from afour dimensional space x 0 ,y 0 ,a,6to a one dimensional space. The param- 
eter space of the ellipse is reduced based on the following observations: 

• The width of the ellipse at iteration ft + 1 is equal to the distance between the right most and left most point of the 
blob at the line corresponding to the current centroid position, i.e. 

"a^=/ l( yjV (3) 
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where function ^ is determined by the boundary of the objects resulting from the connected component analysis. 
• The centroid of the ellipse is located on the so-calied "vertical skeleton" of the blob representing the person. The 
vertical skeleton is computed by taking the middle point between the left-most and the right-most points for each 
line of the blob. The x^ 1 ) coordinate of the centroid of the ellipse at iteration k+1 is located on the vertical skeleton 
at the line yW corresponding to the current centroid position. Hence x(^) will be uniquely determined as a function 
ofyW. 

4 k+ ' } =f & (y < o lc> ). (7) 

where function f 2 is a function determined by the vertical blob skeleton. 

The b parameter of the ellipse (the length) is generally very difficult to obtain with high accuracy due to the difficulties 
in finding the chin line. However, generally the length to width ratio of the ellipse can be considered constant, such as 
M. Then, from Equation (6) ■. 

b (M > = M.a ffr+1) = M./ 2 fyJ* ) ) (8) 

From Equation (5) we write: 

yr , = F(xr ) .a^ 1, ,^" 1) ). (9) 

Equations (6), (7), (8) and (9) lead to: 

yr ] ~-G(y ( o k \ do) 

which describes the iterative ellipse-fitting process algorithm of the present invention. Equation (10) indicates that we 
have reduced the four-dimensional problem of finding the ellipse parameters to an implicit equation with one unknown 

y 0 ■ 

With this in mind, the ellipse fitting process is illustrated in further detail in FIG. 5. In step 503 the edges and the 
vertical skeleton of the foreground regions in the area bordered by x^.Xj are extracted. After the extraction of the 
skeletons of the foreground regions, the y 0 parameter of the ellipse is iteratively computed. 

In one embodiment, the initial y coordinate of the ellipse centroid, yf> is chosen close enough to the top of the 
object on the vertical skeleton in order for the algorithm to perform well for al! types of sequences Irom head-and- 
shoufder to full-body sequence (step 504). Typically the initial value of yf> is selected according to the following ex- 
pression: 

y ( 0 0) = y,+ 0.1 -(y t -y b ) (11) 

where y t is the y coordinate of the highest point of the skeleton and y b is the y coordinate of the lowest point of the 
skeleton. Given the initial point y$\ the ellipse fitting algorithm iterates through the following loop to estimate the ellipse 
parameters. The reference numerals in [brackets] refer to the steps illustrated in FIG. 12, 

[1201] compute parameter 2a<*> by measuring the distance between the left and the right edges of the blob. 

[1202] compute parameter M*) by measuring the y distance between yW and the highest point of the skeleton. 

[1203] compute the error e{k) (in step 505) , 

e(k) =b (k) -Ma lk) . (12) 
.In sum, the goal of the ellipse fitting algorithm described herein is to minimize this value, i.e. to find the ellipse that 
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best satisfies the condition b - Ma , M- 1 ,4 . 



[1204] compute tins new value (step 506} using a linear estimate given by the following equation: 

y^Ky'f+ixem (12A) 



[1205] if the distance between two consecutive centroids is smaller than a threshold, stop the iterations. When the 
iterations stop, %y 0 ,a and b describe the four parameters ol the ellipse. 

Otherwise, go to step 1203. 

The above iterations converge to the ellipse parameters for an ellipse type contour. From equation (1 ), the distance 
between the right most and left most point of the ellipse corresponding to yW is determined by: 



a (k] =2ajH(y ( 0 k) -y 0 )/Ma) 2 (13) 
and the distance between the top of the ellipse and y<*) is determined by 

b < " ) =y Q+ Ma-y l 0 k) (14) 

Hence, for u = 1, equation (1) becomes: 

y' fr+1 1 -y Q = /W a -M a Jl-((y'%)/Ma) 2 (15) 
From the above equation it can be proved that 

for any yW for which \y' Q k> -y\<Ma . This shows that the recurrence defined in equation (10) converges to y 0 . 
7. Eye Detection Process 



The ellipses detected from stage 201, and as described previously, are potentially the region of support for human 
faces. After the detection of these regions, a more refined model for the face is required in order to determine which 
of the detected regions correspond to valid faces. The use of the eye detection process of stage 202, in conjunction 
with the head detection stage 201, improves the accuracy of the head model and removes regions corresponding to 
back views of faces or other regions that do not correspond to a face. Eye detection results can also be used to estimate 
the face pose and to determine the image containing the most frontal poses among a sequence of images.. This result 
may then be used in recognition and classification systems. 

The present invention may use an eye-detection algorithm based on both region size and geometrical measure 
filtering. The exclusive use of geometrical measures to detect the eyes inside a rectangular window around the ellipse 
centroid (eye band: W eye 1001 in Fig 10) may lead to problems in.the analysis of non-frontal faces. In these cases, 
the hair regions inside the eye band generate small hair regions that are not connected to each: other and that are in 
general close in size and intensity to the eye regions. Under the assumption of varying poses, the simple inspection 
of geometrical distances between regions and positions inside the eye band cannot indicate which regions correspond 
to the eyes. Hence, a more difficult approach based on region shape can betaken into account. However, in the present 
invention, a simple method may be implemented to discriminate eye and hair regions that perform with good, results 
for a large number of video image sequences. In this approach, the small hair regions insidethe eye band are removed 
by analyzing the region sizes in a larger window around the upper portion of the face (W face-up 1 002 in Fig 1 0). Inside 
this window, the hair corresponds to the region of large size. 

Stage 202 of FIG. 2 illustrates the steps of the eye detection approach that may be used according to the present 
invention. In step 221, the pixel intensities inside the face regions are compared to a threshold 9, and pixels; with 
intensities lower than Q are extracted from the face region. In step 222, and as shown in FIG. 7A, the connectivity 
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analysis of the extracted pixels generates connected sets of pixels (e.g., pixels 701), i.e. sets of pixels that are adjacent 
or touching. Each of these connected sets of pixels 701 describe a low intensity region of the face. 

In step 223, the pixel regions 701 resulting from steps 221 -222 are filtered with respect to the region size. Regions 
having a small numberof pixels due tocamera noise or shadows are removed. Large regions generally cannot represent 
syes, but instead correspond in general to hair. The size of the regions selected at this stage is in the interval [S m ,8 M ] 
where e m; is the minimum and Q M is the maximum number of pixels allowed by our system to describe a valid eye 
region. Threshold values e m ,e M are determined based on the size of the ellipse that characterizes the head region (the 
ellipse being generated iteratively in step 215). The end result of step 223 is an image 702, such as that shown in FfG. 
7B. 

In step 224, the remaining components within the image of FIG. 7B are filtered based on anthropological measures, 
such as the geometrical distances between eyes and the expected position of the eyes inside a rectangular window 
(eye band) centered in the ellipse centroid. The eye regions are determined by analyzing the minimum and maximum 
distance between the regions inside this band. The output 230 of step 224 is an image, such as shown in FIG. 7C, 
whereby the eyes 703 have bean detected. 

The present invention maybe implemented on a variety of different video sequences trom camera 110. FIGS. SA, 
8B, 8C and 8D depict the results obtained by operating the present invention in a sample laboratory environment, 
based upon the teachings above. FIGS. 8A-8D comprise four different scenarios generated to demonstrate the per- 
formances under different conditions such as non-frontal poses, multiple occluding people back views, and faces with 
glasses. In FIG. 8A, the face 812 of a single person 811 is detected, via ellipse 813. In this figure, the ellipse 813 is 
properly fitted around the face 812 and the eyes 814 are detected even though the person 811 is wearing optical 
glasses on his face 812. 

FIG. 8B shows the back view of a single person 821 in the video scene. In this figure, the ellipse 823 is fitted around 
the head of the person 821, but no eye is detected, indicating the robustness of the eye detection stage 202 of the 
present invention. 

FIGS, 8Cand8D show two scenarios in which two people 831 A and 831 Bare present in the scene. In both figures 
the body of one person 831 B is covering part of the body of the other person 831 A. In both cases, ellipses 833A and 
833B are positioned around the faces 832A and 832B, and eyes 834A and 834B are detected. In FIG. 8D, the face 
832A of the person 831 A in the back has a non-frontal position. Also due to different distances from the camera 11 0, 
the size of the two faces 832A and 832B are different. The faces 832A and 832B of both persons 831 A and 831 B are 
detected indicating the robustness of the system to variations in parameters such as size and position of the faces 
832A and 832B. 

Although the present invention has been described with particular reference to certain preferred embodiments 
thereof, variations and modifications of the present invention can be effected within the spirit and scope of the following 

claims. 



Claims 

1. A system for processing a video image comprising pixels representing a foreground including one or more human 
faces, the system comprising; 

component analysis means (212,213)- to process the pixels of the image to identify a region of connected 
components in the foreground of the image, 

ellipse fitting means (503,504,505,506,507) to perform an iterative ellipse fitting algorithm to fit one or more 
ellipses to the connected components in the identified region, 

means (301) to provide a model of borders for one or more separate individual faces in the identified region, 
probability computing means (403) to perform a computation of the probability of the model based on the 
ellipse or ellipses fitted in the identified region, and means (404,405) to iteratively adjust the model to maximise 
the probability computation. 

2. A system as claimed in claim 1, further comprising means (221) to identify sub-regions within the identified region 
that are below a selected low intensity threshold, 'means (223) to filter out those of the sub-regions that are below 
a selected small size or above a selected large size, and 

means (224) to filter the remaining sub-regions based on anthropological measures to derive co-ordinates 
representing eyes. 

3. A system as claimed in claim 1 or 2, in which the component analysis means (212,213) includes background 
subtraction and thresholding means (21 2) to subtract pixels representing background in the video image. 
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A system as claimed in claim 1, 2 or 3, in which the ellipse fitting means (503,504,505,506,507) include means 
(503) to delect edges and vertical skeleton lines in the identified region, means (504) to form an initial centroid 
estimation, means (505) to compute the error in the centroid estimation, means (506) to compute a new centroid 
estimation and means (507) to stop the iteration when the distance between two centroid estimates is smaller than 
a predetermined threshold. 

A system as claimed in claim 1 , 2, 3 or 4, in which the means (301 ) to provide a model of borders for one or more 
separate individual faces is effective to provide a selection of such models, each such model comprising a different 
selection of vertical borders. 

A system as claimed in claim 5, in which the probability computing means (403) are operable to compute the 
probability of each of a plurality of models, means being provided to select the model for which the highest prob- 
ability is computed. 

A system as claimed in any one of the preceding claims, further comprising a video camera (110) to generate the 
said video image and storage means (150) to store the video image. 

A system as claimed in any one of the preceding claims, further comprising recognition and classification means 
to process the model to detect a face within the video image. 
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FIG. 6D 
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FIG. 7C 
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(54) System and method for detecting a human face 



(57) The present invention relates to a system for 
the processing of video images which include human 
faces. The invention is applicable to a system in which 
the images are generated by a video camera and stored 
in a storage means ready to be processed. 

The systemfor processing the images include com- 
ponent- analysis means (212,213) to analyse the pixels 
of the image to identify a region of connected compo- 
nents in the foreground of the image. An ellipse fitting 
means (503,504,505,506,507) performs an iterative el- 
lipse fitting algorithm to fit one or more vertical ellipses 



to the connected components in the identified region, 
each ellipse representing a possible human face. In or- 
der to distinguish between occluded human figures, a 
plurality of possible models of borders are presented to 
separate individual faces in the identified region. Prob- 
ability computing means (403) perform a computation 
of the probability of each model based on the ellipse or 
ellipses fitted in the identified region. The parameters of 
each model are iteratively adjusted to maximise the 
probability computation for that model and a selection 
is made of the model having the highest probability. 
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